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ABSTRACT 

It is argued that judgments in evaluat ive research 
are ultimately subjective, but that good criteria are available to 
assess their quality. One of these criteria is the robustness of the 
judgments against incompleteness or uncertainty in the data used to 
describe the educational system. The use of the robustness criterion 
is demons traced through the case of a recent evaluation project in 
which the state of elementary education in The Netherlands was 
evaluated. To test robustness, four different procedures were 
simulated for item removal: (1) scaling; (2) removal of easy items; 
(3) removal of difficult items; and (4) removal of extreme items. The 
robustness study demonstrated that the qualifications used in the 
evaluation project were quite stable under the removal of items from 
the pool by these four methods. Nearly all the qualifications met the 
rigorous criterion of robustness. An appendix discusses the 
independence of the mean observed score of covariation between 
abilities. (Contains 3 tables, 8 figures, and 17 references.) 
(Author/SLD) 
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Abstract 



The point of view is taken that judgments in evaluative research are ultimately 
subjective but that good criteria are available to assess their quality. One of these 
criteria is robustness of the judgments against incompleteness or uncertainty in the 
data used to describe the educational system. The use of the robustness criterion is 
demonstrated for the case of a recent evaluation project in which the state of 
elementary education in 'Die Netherlands was evaluated. 
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KoI>u$tness of Judgments in Evaluation Research 

Typically^ the first stage of an evaluation project consists of a careful 
description of the state of an educational object or system. In the next stage, the 
state of the system is evaluated through a series of evaluative statements or 
judgments. Examples of such judgments are: "The quality of teaching in the system 
is excellent"; "Too many students in the system do not reach a satisfactory level of 
proficiency in physics"; and **School management is poor". If the goal of the 
evaluation project is to serve a reorientation of a policy with respect to the system, 
the judgments usually result in a series of recommendations to improve the 
functioning of the system. 

For liie descriptive stage, the standard methodology of empirical research 
in the social sciences is available. This methtxiology includes the use of such 
methods as survey and observation as well as various techniques of (multivariate) 
descriptive statistical analysis to summarize the results. Though descriptive 
statements can be founded on a rigorous methodology, judgments seem to lack this 
support Tlie main reason is the use of such qualifications as "excellent", "not 
satisfactory", and "p(H)r" in the examples alxwe. The choice of such qualifications, 
as well as their definitions, is a subjective matter. However, subjectivity is not 
necessarily erratic, and criteria for good qualifications do exist. Judgment does not 
imply lack of rationality. 

One criterion for the quality of judgments is consistency. For example, 
suppi^se that empirical research has shown time and again that certain instructional 
measures lead to an increase in the achievements of the students in a given domain. 
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and thai a system to be evaluated scores high on the use of these measures. Then, 
ignoring the role of costs well as the possibility of interaction between factors in 
the system, it seems inconsii tent to make judgments that provide the former finding 
with a negative and the lat.er with a positive qualification. Such evaluations are 
inconsistent in the sense that they imply a world that can never exist It should be 
noted that in this example empirical research was used to show that a set of 
qualifications is inconsistent. Empirical research can only provide the evaluator with 
objeaive infoniiation about what worlds are possible and what not. It remains a 
subjective choice to evaluate one possible world over the other. 

Another obvious criterion is expliciiness. The criterion of explicimess 
includes the requirement that all judgments be based on explicit definitions of the 
qualifications and procedures used in the evaluation. If this requirement is not met, 
the evaluator can never communicate his evaluations to others in a meaningful way. 
Also, it will never be possible to test these evaluations for consistency in the sense 
defined above. 

It is not the purp<^)se of this paper to give an extensive overview of criteria 
for the use of quaUfications in evaluation research (for a more complete review, see 
van der Linden, to appciu"). Rather, the emphasis is on one criterion of a more 
technical imture ihiui Uic previous examples. The criterion is necessary because 
judgments may have to be biLv^d on a description of the slate of the system which 
is incomplete, uncertain, or erroneous due lo the quality of the data. An example is 
an evaluation project in which the stale of some relevant throughput factor is not 
precisely known. In such a atse. which is certainly not untypical of educational 
evaluation* Uie cvaluiiuif may have to base his or her judgments on a best guess as 
to the sliitc of this p;ui of tlic system. An important criterion for the quality of his 
or her judgments, then, is robustness. Generally, a judgment is robust if minor 
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changes in the description of the state of the system do not lead to changes in the 
qualifications used in it. The idea underlying this criterion is obvious: Uncertainty 
about some part of the state of the system is less critical, the less dependent the 
qualifications are on the precise state the part of the system is in. The robustness of 
qualifications is usually assessed through a series of analyses in which changes in 
the values of some of the variables are made to simulate uncertainty about the stale 
of the system, wherealter it is detennined to what extent the qualifications would 
have to change. Obviously, robustness analyses are only possible if both the 
qualifications and the procedures leading to them are defined explicitly. 

In the remainder of this paper, the results from a robustness study in a 
recent evaluation project in The Netherlands are reported to illustrate the possible 
contribution of robustness analysis to educational evaluation. The project was run 
by the Committee for the Evaluation of Elementary Education (CEB). In the next 
section, the problem addressed in the study is described. Subsequently, the methods 
of analysis will be given and the results will be discussed. The paper concludes with 
a discussion of the practical implications of the study. 

Introduction to the Problem 

The evaluation committee was appointed by the Dutch Seaetary of 
Education in 1991. lis mission was to evaluate the state of elementary education in 
the Netherlands from 1988-1992. In particular, tlie interest was in an evaluation of 
four different aspects of elementary education in this period, its level of 
achievements being one of them. The results of the evaluation were published 
recently (CommissicEvaluatii' Dasisonderwijs, 1994a. 1994b, 1994c, I994d, 1994c). 
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A fuller description of the assignment to the conmiittee is given in Janssens (1995). 

The committee had to report its finiings at a level of aggregation that 
would suit a possible reorientation of the current policy of the Ministry of Education 
with respect to elementary education. Another constraint was that resources for data 
gathering were limited, and that the committee had to use existing sources of 
empirical data to perfonn its evaluation. 

To present its evaluation of the achievements, the committee used the item 
material and scales from PPON. In this large-scale program for the assessment of 
educational progress in The Netherlands, which is run by the National Institute for 
Educational Measurement (Cito), the level of achievement in elementary education 
is periodically fathomed. The basic methodology used in PPON to scale the item 
pools and score the achievements is item response theory (IRT). The use of this 
meihtxiology restricts the scaling of the items to the level of homogeneous subsets 
of the pool each measuring the same ability. An overview of the number of scales 
that were necessar>' to scale the item pools for the various subjects is given in Table 
1. 



Table 1 about here 



For a complete review of the methodology used in PPON as well as reports 
of its assessments, tiie reader should consult van der Schoot (1993), Sijtstra (1992), 
Vinj6 (1993), van Weerden (1993), WijnsU-a (1998, 1990), and Zwarts (1990) 
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Definition of Qualifications 

The selection of the qualifications by which the committee evaluated the 
achievements was guided by various considerations, three of which need to be 
explained here to be able to define the research problems addressed in this paper: 

1. As already mentioned, the evaluation had to be reported at a level of 
aggregation suitable for recommendations on policy decisions. Therefore, 
it was necessary to combine sets of separate PPON scales into higher-level 
measures of achievement. For example, six separate scales for reading 
(Reading Reports; Readuig Persuasive Texts; Reading Arguments; Reading 
References; and Reading Tables and Graphs) were combined into a single 
measure for Reading Comprehension. As IRT scales were not possible at 
this level of aggregation, the simple number of items correct score was 
used as a measure of achievement. However, this measure can be estimated 
from the scores on the IRT scales underlying the aggregate (see below). 
The number of aggregates in the evaluation is given in the last column of 
Table 1. 

2. A second fonn of data reduction was also necessary to report the 
evaluations. The achievements of the population of students were in tlie 
fonn of distributions of scores. A usual way of defining qualifications for 
distinguishing between "good distributions" and "bad disuibutions" is in 
terms of their moments. Based on displays of the estimated distributions of 
the observed scores, the committee opted for qualifications for the first 
moments or means of the distributions. The main purpose of inspecting the 
displays was to get familiar with the relation between the location of the 
means and the shape of the left tails of the disu*ibution. The qualifications 
were knowingly selected to be conscr\'ativc; that is, relatively large 
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proportions of students in the population had to be at the lower ends of the 
achievement scales before an unfavorable qualification applied, 
3. Instead of qualifications in the fonn of a simple good/bad dichotomy, the 
coimnittee chose three different qualifications for which the tenns 
"Satisfactory" (Dutch: voldoende), "Moderate" (Dutch: maiig) and 
"Unsatisfactory" (Dutch: onvoldoende) were used. As a compromise 
between tlie fact that evaluations in terms of observed scores are dependent 
on item pool content and the fact that a single set of qualifications is easier 
to coimnunicate, the coirunittee opted for a coiiunon definition of 
qualifications with adjustments for item pools that were deemed to be too 
difficult or too easy. 

In fact, the definition of the qualifications was a long process in which such 
factors as familiarity wi(h the curriculum, :?n:huig practices, quality of the learning 
materials, previous evaluations, and extensive consulting of relevant parties played 
an important role. The results are given in Table 2, 



Table 2 alx^ut here 



Hstimntion of Mean Observed Scores 

Two typical distributions of observed scores are given in Figure 1, Both 
distributions were estimated using liie assumption of a correlation equal to ,80 



Figure 1 about here 



11 



Robustness of Judgments 
8 



between tlie abilities on the under! ing IRT sciUes. 

The distribution tor Calculating was evaluated as "Mcxlerate". Its mean was 
just higher than the lower bound tor this category but some 13% of the examinees 
solved less than one third of the items correctly. The distribution for 
Proportions/l^erceniages was estimated to have a mean in the category 
"Unsatisfactory". In this distribution, 36% of the examinees had less than one third 
of the items correct. 

The means of the observed-score distributi .)ns were calculated from the 
item pju*ameters estimated in the PTON projects. These estimates were obtained 
under ilie one-parameter logistic model with imputed values for the discrimination 
panuncier (Verhclst. Glas & Verstralen, 1994), The ability distributions were scaled 
to be nonnal with meai) 250 aiid standard deviation 30. Under the previous 
assumptions, the mean of aii observed-score distribution can simply be calculated 
from tiie coimnon marginal ability distribution and the sum of the response 
functions. Tliis chiim is proved in the Appendix. 

Research Problems 

The decision to use PIK)N item materi<il mid scales entailed two questions 
boUi related to the use of IRT in PPON. 

First, tliough there is national agreement tliat the blueprints for the item 
pix)ls had high content validity and that the sets of items in the pools covered the 
blueprints, some of tJie items were removed from tlie v^riginal pools in tlie scaling 
process. For example, for Aritlimetic 4% of tiie items was removed from a pool of 
491 items, whereas for Dutch was removed from a pool of 498 items. These 
numbers are not hu-ge but importiuit enough to pay attention to. As these items were 
removed on tJie basis of values of psychometric p<irjuneters and not of their content. 
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it seems safe to conclude that: 

1. The resulting pools sdll define Uie same ability variables, and that these 
variables have therefore not lost their validity; anJ 

2. The removal of some of the items from the pools may nevertheless have 
had effecis on the observed-score distributions, and hence on the judgments 
by the committee. 

An important question is how serious these possible effects are. 

Second, only the marginal ability distributions were available from PPON. 
As already ex|>!ained, the choice for the mean as the critical moment of the 
disiribution of observed-scores was based on plots of observed-score distributions. 
However, under tiie assumption of multivariate nonnality, to be able to plot 
observed-score distributions for aggregates of IRT scales, Pe^irson's correlation 
between the abilities must be known. (Remember that this requirement does not hold 
for the mean of the distributions.) As the abilities in each aggregate were "close", 
and numerous research projects have shown high correlations between subtests 
covering different aspects of, for example, Lmguage and arithmetic, the assumptions 
of correlations in the neighborhood of .80 seemed realistic. An important question 
is how serious the consequences of violation of this assumption are. 

Both questions were addressed in a robustness study. 
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Method 

Removal of Items 

Four different procedures of item removal were simulated. In each 
procedure, after ihe removal of an item the mean of the observed-score disu*ibution 
wiis calculated, and the correct qualification from Table 2 was selected. 

The following procedures were studied: 

1. vScaling . The pair of items with the smallest difference between their values for 
the difficulty paraincter was selected, ajid one item of the pair was chosen at random 
and removed from the pool. The mean of the observed-score distribution was 
calculated, and the appropriate qualification was identified The steps were repeated 
until the pool was empty. This procedure simulates item analysis in which the range 
of the scale values of Uie items has to remain maximal but redundancies are 
removed by eliminating items from subsets that cluster too strongly. The procedure 
applies when the ideal is a pool of items with unifonnly distributed scale values. 

2. Eaw items . The item with the smallest value lor the difficulty parameter was 
removed from the pool, the mean of Uie observed-score distribution was calculated, 
and iJie appropriate qualification v/as identified. The steps were repeated until the 
|X)ol was empty. Tliis procedure simulates the case where the item pool is 
considered too easy. 

3. Difficult items . The previous procedure was repeated, but now at each step the 
most difficult item wjls removed. 

4. ExU'eme items . This procedure is a combination of die previous two procedures. 
Altenmtely. Uie easiest and the most difficult item were removed. This procedure 




Robustness of Judgments 
11 



simulates the case where the item poo! is considered to be on target but, for 
example, the distribution of abilities of the examinees is expected to have less spread 
than the item pooL 

Correlation between Abilities 

To assess the robustness of the observed-score distributions wi«h respect to 
the correlation between the abilities, a Monte Carlo method was used to generate 
observed-score disU"ibutions on the sets of items in the aggregates for various values 
of the correlation coefficient. As a correlation between the abilities lower than .60 
was most unlikely, the following values for the correlation coefficient were used: 
.60, .70, .80, and .90. 

In the ('rscriplion of the Monte Carlo procedure below, the notation of the 
variables is the siune as in the Appendix but the indices] = 1,...,J a;;d i = are 
now used to denote the abilities and the items in a subset for the sjime ability, 
respectively: 

! . For each simulated examinee, the values of the vector of abilities (9 j,..,9j) 
were drawn from a multivariate nonnal distribution with the assumed 
(common) value of tlie correlation coefficient. 

2. The true scores (tj tj) were calculated its 

/ 

/;= I Pim. y=i,...,7, 
^ /-I ^ 

and nonned on 10,1]. 

3. The conditional distributions of Xj given Tj=tj are generalized binomiiil. 
Their probability functions, Prob(Xj\ were C{ilculated using the first tenn 
in tlie expansion of the gcncnilized biiunnijil probability function given in 
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Lord and Novick (1968, sect. 23.10). 

J 

The probabilities of the number-correct scores, Z AT/, were calculated 



The last step in the procedure made use of the fact that for a fixed 
examinee the observed scores Xy \-\ J, were independent. 

The accuracy of the approximation in Step 3 was checked against an 
algorithm suggested by Lord and Wingersky (1984) which produces the full 
generalized binomial distribution (see below). 

The procedure was repeated for N= 10,000 exajninees. It should be noted, 
however, tliat for each examinee not one realization of Xj given Tj=tj but its full 
conditional distribution was generated. The number is tlius large enough to guarantee 
a smooth ajid stable result. 



Graphs are used to present the results for the scaling procedure. In Figure 
2, the mean observed relative scores for Uie five aggregates in Arithmetic are 
displayed as a function of the proportion of items removed due to seeding. 
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qUi'ilifications defined in Table 2. Generally, the curves follow a flat course, 
indicating extreme robustness of the mean with respect to the removal of items due 
to scaling. To cross one of the lines, tlie removal of 91% of the items for Basic 
Skills and 100% of the items for Proportions/Percentages was needed. For 
Calculating, the percentage was equal to 62%. The percentages for Fractions and 
Measurement are lower but still equal an impressive 45% and 33%» respectively. 
After these values, the two last curves started moving back and forth between the 
two sides of the upper (Fractions) and lower lines (Measurement). This behavior is 
typical of mean scores that were close to the borderline between two qualifications, 
remained there <'il'ter removal of ilie items, but showed small fluctuations. 

The results for the aggregates in tlie other subjects are given in Figures 3 
through 6. 



Figure 3-6 about here 



The results are generally the same Jb; for Arithmetic. All curves had a fiat course, 
and, except for Reading English, at least 30-40% of the items had to be removed 
before the qualifications change. The case of Reading English is an interesting one. 
The curve was flattest of all curves in Figures 2-6, but the curve coincided with the 
upper line nearly perfectly. T\\o same phenomenon was observed for Reading 
Comprehension. Its curve was jilso flat and unifonnly close to the line between 
"Satisfactory" and "Moderate^'. Nevertheless, 38% of Uie items had to be removed 
from the ptx)! to change tlie qualification. At a later stiige, the curve moved back to 
the original qualification. In iLs report, tlie committee made the provision that 
impt)rlant p«u*ts of this aggregate were less favorable tiuui the general impression 
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suggested. Also, uncertainty was expressed due to the fact that data from an 
international comparison of achievements in Reading Comprehension had yielded 
conflicting infonnaiion (Coimnissie Evaluaiie Basisondenvijs, 1994a, sect. 5,1). 

The results for all four principles of item removal are given in Table 3. The 
first column gives tlie percentages of items that had to be removed for the scaling 
procedure. The next three columns present the results for the other item removal 
procedures. Obviously, removal of the most difficult or easy items inu*oduced a shift 
in the observed-score distributions, and generally the qualifications changed 



Table 3 about liere 



earlier than in the pa^vious case. Nevertheless, with the exception of Measurement 
and Read ng Comprehension for tlie removal of the easiest items and Biology for 
the most uifficult items, the qualifications were remarkably robust for all aggregates. 
In these exceptional cases of change, again the mean observed scores were already 
close to the borderline between two classifications for the intact item pool. For 
exajnple, for Reading CiHnprchension the mean relative observed score for the intact 
item pool for tlie pool was .71, a result close to the cut-off score of .70 separating 
"Satisfactory" from "Moderate" (see Figure 2). ITie removal of the items with 
extreme difficulty values at both ends of the scale had, except for Reading of 
English, no noticeable effect on tlie qualifications. In the majority of tlie cases, 
nearly all items had to be removed before the qualification changed. 
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In Figure 7, two typical observed-score distributions for values of the 
correlation coefficient in the range from .60.90 are shown. The effect of lowering 



Figure 7 about here 



the value of the correlation was a small shift of the mode of the distribution to the 
center of the scale. (However, remember thai this phenomenon does not hold for the 
mean of the distribution. This parameter is independent of the value of the 
correlation coefficient.) Consequently, the value of the correlation coefficient does 
have some effect on the left tail of the distribution, but the effect is not dramatic. 
It seems siife to conclude that the relation between the mean and the left tail of the 
distributions observed by the committee does not change much in the neighborhood 
of r=.80. 

As ah'eady observed, in Step 3 of the prcxredure for generating the 
obser\^ed-score distributions, an approximation to the generalized binomial 
distribution of X given T=t was made. The quality of the approximation was 
checked by comparing its results agaitist those obtained for ilie exact distributions 
using the computer program AAPMOMT which implements the algorithm by Lord 
and Wingersky (1984) referred to cwlier. The results were always virtually identical. 
Figure 8 gives the distributions for tiie s<tme two aggregates as in Figure 7. 



Figure 8 about here 
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The approximation proved to be excellent; the difference between the results of the 
two methods is hardly discernible. 

Discussion 

The main conclusion from the robusniess study reported in this paper is that 
the qualifications used in the evaluation project are quite stable under the removal 
of items from the pool according to the four procedures defined above. Nearly all 
of the qualifications thus met a rigorous criterion of robustness. 

In this study, the results for the scaling procedure are most important since 
this procedure comes closest to the procedure actually used in the PPON projects. 
However, it should be noted that liie former is an idealized version of the lailcf, and 
that differences between the two may exist. Also, the procedure was applied to the 
item pools that were the results from PIK)N item aivilyses, and not to the original 
pools. Genenilizing the findings to the original pools thus involves an element of 
extrapolation, albiMt that the differences between the sizes of the two kinds of pools 
were generally small. Also, the fact Uiat» with a few exceptions, remarkably robust 
results were obtained for procedures that deliberately made the item pcK^ls easier or 
more difficult does lend some support to the claim that this generalization is unlikely 
to involve serious bias. 

It is emphasized that robustiiess of qualifications is only one necessary 
criterion which judgments in evaluation projects must meet, and tliat judgments are 
not automatically meaningful if they are robust. However, as illustrated in this paper, 
if unccrliiinty exists as to the knowledge base on which the judgments have to based, 
then robustnCvSs anjilysis is an excellent mccins to assess how serious the 
consequences of this uncertainty are. 
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Appendix 

Independence of Mean Observed Score of Covariation between Abilities 

For ease of exposition, the case of two distinct abilities is addressed. Let 
9 J aiid 02 be these two abilities. The bivariate distribution of the two abilities is 
represented by probability density function f(9j.92). whereas the marginal 
distributions of 9j aiid 82 are denoted as fi(9j) aiid f2(62^* ^1 ^2 ^ 
observed scores on the item sets measuring 9j and 92 aiid Tj en T2 the classical 
LTue scores for these observed scores. 

In PPON, the marginal distributions of 9j ai^d 92 are scaled to have 
common marginal densities: 

fl(9i)=f2(e2)=f(9). 

This feature is used in the proof below. The Hrst step in the derivation follows from 
classical test theory, whereas the other steps are siraightforwani. Indices i aiid j 
denote items measuring the first and second ability, respectively. The proof runs as 
follows: 
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£(Xi *X2) = E(Ti*T2) 

= JJ[I PjO,) +1 />y(ey)lA8 1,92^9 icfe2 

= Jl/'y(ei)[Jy(9i,92W92]cf9i + Jl Py(92)[Jy(e 1.92^9 1^92 

= Jz/',(9i)/i(9iW9i + Jl /'/92)/2(92W92 

' j 

Hence, when calculating the mean observed score, possible covariation 
between the underlyiiig abilities am be ignored, and liie item response function may 
be summed across abilities. 
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Table L 

Aggregation of PPON scales in evaluation project 



Subject # Original Scales # Aggregates 



Dutch Language 13 7 

Arithmetic 27 5 

World Orientation 30 8 

Ilnglish 5 5 

Traffic 2 1 



Note . World Orientation is a combination of subjects. See Tabic 3. 



ERIC 
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Table 2. 

Dermition of quajifications used in evaluation 



Qualification Meaii of Score Distribution 

Satisfactory > 70% 

Moderate 55% - 70% 

llnsiitisfactory < 55% 



Note . For item po<^ls judged to be too difficult a downward adjustment of 10% and 
59c was made for the lower bounds of Satisfactory and Moderate, respectively. 
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Table 3 

Percentages of items needed to change the qualifications for tlie four methods 



Subject 

Scaling Easy Difficult Extreme 



Arithmetic 



Basic Skills 


91 


14 


27 


100 


Calculating 


62 




lo 




Fractions 


45 


1 / 


y 


100 


Proportions/Percentages 


100 




1 A 


100 


Measurement 


33 


.1 




Zi 


Dutch 










Reading Comprehension 


37 


1 

J 


inn 


100 


Listening 


94 




inn 


yl 


Composition 


71 




lUO 


100 


Spelling 


39 


34 


100 


100 


Grammar 


79 


54 


100 


100 


Parsing 


71 


37 


13 


100 


Language Reflection 


100 


32 


100 


100 


World Orientation 










Biology 


41 


28 


4 


26 


Physics 


66 


13 


17 


100 


Regional Geography 


88 


26 


12 


100 


Physical Geography 


91 


16 


17 


100 


Totx)graphy 


100 


100 


17 


100 


History 


40 


47 


100 


100 


Spiritual & Religious Movements 78 


30 


100 


39 


Social Relations 


97 


20 


100 


100 


English 










Reading 


3 


3 


100 


6 


Listening 


96 


41 


100 


100 


Speaking 


97 


27 


14 


100 


Vocabulary 


59 


24 


13 


100 


Use of Dictionary 


100 


75 


100 


55 


Traffic 










Practical Skills 


91 


42 


100 


100 
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Figure Captions 

Figure 1 . Estimated observed-score distributions for: (a) Calculating; and (b) 

Proportions/Percentages. 
Figure 2 . Mean observed score as a function of the proportions of items removed 

due to scaling for Arithmetic. 
Figure 3 . Mean observed score as a function of the proportions of items removed 

due to scaling for Dutch. 
F igure 4 . Mean observed score as a function of the proportions of items removed 

due to scaling for World Orientation. 
Figure 5 . Mean observed score as a function of the proportions of items removed 

due to scaling for English. 
Figure 6 . Mean observed score as a function of the proportions of items removed 

due to scaling for Traffic. 
Figure 7 . Estimated observed-score distributions for: (a) Calculating; and (b) 

Proportions/Percentages (different correlation between abilities). 
FMgurc 8 . Observed-score distributions estimated by: (a) Taylor approximation to 

gcncnilized binomial; and (b) exact distribution function. 
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